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Abstract 

Model  checkers  search  the  space  of  possible  program  be¬ 
haviors  to  detect  errors  and  to  demonstrate  their  absence. 
Despite  major  advances  in  reduction  and  optimization  tech¬ 
niques,  state-space  search  can  still  become  cost-prohibitive 
as  program  size  and  complexity  increase.  In  this  paper, 
we  present  a  technique  for  dramatically  improving  the  cost- 
effectiveness  of  state- space  search  techniques  for  error  de¬ 
tection  using  parallelism.  Our  approach  can  be  composed 
with  all  of  the  reduction  and  optimization  techniques  we 
are  aware  of  to  amplify  their  benefits.  It  was  developed 
based  on  insights  gained  from  performing  a  large  empirical 
study  of  the  cost-effectiveness  of  randomization  techniques 
in  state-space  analysis.  We  explain  those  insights  and  our 
technique,  and  then  show  through  a  focused  empirical  study 
that  our  technique  speeds  up  analysis  by  factors  ranging 
from  2  to  over  1000  as  compared  to  traditional  modes  of 
state-space  search,  and  does  so  with  relatively  small  num¬ 
bers  of  parallel  processors. 

1.  Introduction 

The  first  general  tool  for  model  checking  programs  [12] 
was  developed  nearly  ten  years  ago.  The  realization  that 
variants  of  temporal  logic  model  checking  algorithms  could 
be  applied  to  search  the  space  of  possible  program  behav¬ 
iors,  to  detect  errors  and  demonstrate  their  absence,  has 
spurred  a  tremendous  body  of  research  in  the  past  decade. 
Much  of  this  work  has  been  oriented  towards  developing 
general  techniques  for  reducing  the  analysis  cost  through 
property  preserving  state- space  reductions,  e.g.,  [16,  5], 
and  abstraction  techniques,  e.g.,  [1,  13].  Another  line  of 
research  has  adapted  model  checking  algorithms  and  data 
structures  to  optimize  error  detection  while  sacrificing  the 
ability  to  demonstrate  the  absence  of  errors.  Notable  suc- 
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cess  has  been  achieved  along  these  lines  for  sequential 
program  analysis,  for  example,  analysis  and  detection  of 
classes  of  errors  in  the  linux  kernel  [22],  TCP/IP  implemen¬ 
tations  [19],  and  widely-used  file  system  implementations 
[29]. 

Success  in  detecting  concurrency -related  errors  on  soft¬ 
ware  of  realistic  scale  and  complexity,  however,  has  been 
more  difficult  to  achieve.  The  key  complicating  factor  with 
concurrency  is  the  need  to  analyze  program  behavior  un¬ 
der  the  set  of  possible  schedules  that  could  be  produced  by 
the  run-time  system.  In  general,  the  set  of  all  possible  exe¬ 
cutions  grows  exponentially  with  the  number  of  threads  of 
control  in  a  program.  Concurrency  errors,  such  as  dead¬ 
locks  and  data-inconsistencies  that  arise  due  to  data-races, 
can  be  very  difficult  to  detect  since  they  may  only  be  exhib¬ 
ited  on  a  small  fraction  of  the  possible  program  executions. 

Systematic  search  of  a  program’s  feasible  state- space, 
i.e.,  the  set  of  control  and  data  configurations  that  can  be 
reached  along  some  program  execution,  is  attractive  for 
these  hard  to  find  errors,  since,  given  sufficient  time  and 
memory  the  error  will  eventually  be  revealed.  Unfortu¬ 
nately,  even  when  the  full-complement  of  state-of-the-art 
state-space  reduction  techniques  are  applied,  there  are  pro¬ 
grams  for  which  such  an  analysis  will  exhaust  available  time 
and/or  memory  before  detecting  the  error  [6] .  In  this  paper, 
we  address  the  challenge  of  providing  additional  reductions 
in  analysis  cost  by  exploiting  knowledge  we  have  acquired 
studying  program  state-space  structure  as  it  relates  to  error 
states,  and  using  this  knowledge  to  create  a  technique  that 
parallelizes  the  analysis. 

Our  insight  on  parallelization  opportunities  emerged 
from  our  recent  investigation  of  how  the  order  in  which  a 
state-space  is  searched  infiuences  the  cost  and  effectiveness 
of  detecting  errors  [6].  Our  empirical  study  of  56  multi¬ 
threaded  Java  programs  showed  that  random  variations  in 
the  search  order  give  rise  to  enormous  variations  in  the  cost 
to  find  an  error  across  a  space.  It  was  common,  for  exam¬ 
ple,  to  find  programs  where,  given  a  few  hundred  random 
searches,  the  fastest  search  order  outperformed  the  slowest 
by  four  or  five  orders  of  magnitude. 
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Ideally,  to  improve  the  efficiency  of  the  error  detection 
process,  one  would  like  to  guide  the  model-checker  towards 
regions  of  the  program  state- space  that  contain  errors,  and 
avoid  regions  that  are  free  of  errors.  Distinguishing  such 
regions  without  first  exploring  them,  however,  is  beyond 
the  current  state-of-the-art  in  search  heuristics.  Instead,  we 
have  developed  a  technique,  which  we  call  Parallel  Ran¬ 
domized  State-space  Search  (PRSS),  that  runs  multiple  par¬ 
allel  randomized  state- space  searches,  and  terminates  all 
searches  when  the  first  one  finds  an  error.  The  intuition 
behind  PRSS  is  that  by  sampling  different  regions  of  the 
state-space,  there  is  a  good  chance  that  a  region  contain¬ 
ing  errors  will  be  found.  In  addition,  by  exploring  regions 
in  parallel,  the  time  required  to  search  regions  that  do  not 
have  errors  is  mitigated.  Our  evaluation  of  the  PRSS  tech¬ 
nique  on  the  most  challenging  of  the  multi-threaded  Java 
programs  from  our  previous  study  demonstrates  that  PRSS 
can  reduce  the  cost  to  find  an  error  using  state-space  search 
by  factors  ranging  from  2  to  well  over  1000,  and  that  this 
reduction  can  be  achieved  using  a  relatively  small  number 
of  parallel  processors,  ranging  from  5  to  20. 

In  addition  to  improving  the  cost  to  find  an  error,  PRSS 
has  a  number  of  other  benefits.  For  example,  PRSS  is  a  gen¬ 
eral  technique  that  can  be  composed  with  existing  reduc¬ 
tion,  abstraction  and  heuristic  techniques  to  further  enhance 
the  gains  achieved  by  those  techniques.  Furthermore,  it  ap¬ 
pears  to  be  broadly  applicable  across  a  range  of  programs. 
Its  performance  benefits  accrue  when  run  on  numbers  of 
processors  that  most  developers  will  have  ready  access  to, 
for  example,  in  a  handful  of  multi-core  workstations.  In 
principle,  PRSS  could  be  implemented  using  any  explicit 
state  model  checker  or  similar  state- space  analysis  tool.  In 
this  paper  we  report  on  results  using  version  3.1.2  of  Java 
PathFinder  [27]. 

The  contributions  of  this  paper  lie  in  (i)  the  presenta¬ 
tion  of  a  practical  and  cost-effective  technique  for  detecting 
hard  to  find  errors  in  concurrent  programs,  which  we  detail 
in  the  next  Section,  and  (ii)  the  results  of  an  empirical  study 
that  provide  evidence  of  the  effectiveness  of  the  PRSS  tech¬ 
nique  instantiated  for  Java  PathFinder  as  compared  to  us¬ 
ing  the  default  mode  of  analysis  with  Java  PathFinder  over 
a  range  of  non-trivial  multi-threaded  Java  programs.  Sec¬ 
tion  4  describes  our  study  design  and  setup,  and  we  present 
and  discuss  the  results  of  the  studies  in  Section  5.  We  dis¬ 
cuss  related  work  in  Section  6  and  describe  plans  for  further 
assessing  the  effectiveness  of  PRSS  in  Section  7. 


2.  Motivation  for  PRSS 

In  previous  work  [6],  we  discovered  that  randomizing  the 
order  of  program  state- space  search  can  sometimes  lead  a 
model-checker  to  locate  an  error  state  very  quickly,  outper¬ 
forming  a  model  checker’s  default  search  order.  This  is  not 
very  surprising.  Given  enough  randomized  searches  one  is 
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Figure  1.  Search  Cost  Distributions 

bound  to  find  a  search  that  detects  errors  more  quickly  than 
the  default  search  order,  which  is  generally  defined  without 
regard  to  program  structure  or  the  type  of  error. 

What  was  surprising  however,  was  the  degree  of  varia¬ 
tion  in  the  cost  of  search  across  different  programs.  Some 
programs  exhibited  cost  distributions  that  were  fiat,  indicat¬ 
ing  that  searches  of  varying  cost  were  equally  likely,  some 
were  clustered,  indicating  that  all  searches  within  a  given 
group  had  similar  cost,  some  were  close  to  Gaussian,  and 
some  were  bipolar,  i.e.,  two  clusters  at  the  low  and  high- 
end  of  the  cost  scale.  Figure  1  illustrates  cost  distributions 
for  two  of  the  programs  in  our  study  utilizing  histograms. 
The  x-axis  represent  the  number  of  states  visited  by  the 
model-checker  and  each  bar  represents  the  percentage  of 
5000  randomized  depth-first  search  runs  performed  on  the 
given  program.  With  DEOS  and  ReplicatedWorkers 
we  observed  variations  in  cost  that  spanned  one  or  more  or¬ 
ders  of  magnitude;  this  is  representative  of  the  population 
of  programs  we  studied.  The  key  observation  we  made  was 
that  despite  this  enormous  range,  there  were  always  some 
relatively  low-cost  runs,  on  the  order  of  10s  or  100s  of  thou¬ 
sands  of  visited  states  that  detected  the  error.  For  example, 
DEOS  and  ReplicatedWorkers  have  18%  and  17%  of 
their  runs  that  found  the  error  in  this  low-cost  region,  re- 
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DFS(5) 

4  workSet  :=  enabled{s) 

5  for  each  a  G  workSet  do 

6  s'  :=  a{s) 

7  if  error  {s')  then 

8  counterexample  :=  stack 

9  exit 

10  if  s'  ^  seen  then 

11  seen  :=  seen  U  {s'} 

12  push{stack^  s') 

13  DFS(5') 

14  pop  {stack) 
end  DFS() 

Figure  2.  Basic  DFS  for  first  error  state 

randDFS(5'^^J) 

1  seen  :=  {sq} 

2  init-rand{seed) 

3  push{stack,  So) 

4  DFS(5o) 
end  randDFSO 
SHUFFLE(seq) 

5  for  each  i  :=  0  . . .  \seq\ 

6  r  :=i  {rand{)^ 

{\seq\  -  1)) 

7  t  :=  seq[r] 

8  seq[r]  :=  seq[i] 

9  seq[i]  :=  t 
end  SHUFFLEO 

Figure  3.  Randomized  DFS 


PRSS(N,  seed) 

1  init-rand{seed) 

2  for  each  i  :=  1  . .  .N 

3  start{Y2in6D¥S{rand(y)d) 

4  while  (true) 

5  for  each  j  :=  1  ...N 

6  if  {done{j))  then 

7  first  :=  j 

8  break  while  @4 

9  endif 

10  for  each  k  :=1  ...N 

11  if  7^  first)  then 

12  stop{k) 

13  pnnt{counterexamplejj^^l.) 

end  PRSSO 

Figure  4.  Paraiiei  randomized  DFS 

3.1.  Depth-first  State-space  Search 

Our  analysis  involves  a  stateful  search  of  a  program’s 
state-space.  Researchers  have  proposed  the  use  of  stateless 
search,  e.g.,  [24],  but  our  experience  using  such  searches 
indicated  that  it  is  not  cost-effective  for  programs  with  hard 
to  find  bugs,  i.e.,  where  the  percentage  of  executions  of  the 
program  that  exhibit  the  error  is  near  zero.  For  example, 
on  the  Elevator  program  in  our  study,  in  over  3  hours 
of  run-time,  10,000  randomized  stateless  searches  were  un¬ 
able  to  detect  the  error,  whereas  our  randomized  stateful 
searches  always  found  the  error  with  a  mean  run-time  of  6 
minutes.  We  used  depth-first  search  (DFS)  as  the  basis  for 
PRSS  in  this  paper;  we  plan  to  explore  the  use  of  variants 
of  breadth- first  search  in  future  work. 
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3  DFS(5o) 
end  basicDFSO 
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spectively. 

This  trend  holds  up  across  all  of  the  56  programs  we 
studied  in  [6] ;  we  also  found  at  least  some  runs  that  were 
significantly  less  expensive  than  the  mean  cost  across  the  set 
of  randomized  runs  we  explored.  From  this  observation,  we 
conjectured  that  by  performing  enough  randomized  search 
runs  we  would  eventually  be  able  to  find  a  run  that  can  find 
an  error  quickly.  We  leverage  this  conjecture  by  running 
the  searches  in  parallel  to  reduce  the  wall-clock  time  for  de¬ 
tecting  errors,  which  led  to  our  Parallel  Randomized  State- 
space  Search  technique. 


3.  The  PRSS  Technique 

The  PRSS  algorithm  is  an  integration  of  classic  depth- 
first  search  (DFS)  to  find  error  states,  randomized  DFS,  and 
parallelized  search.  We  explain  each  of  these  aspects  in  turn 
by  highlighting  portions  of  the  overall  algorithm. 


Abstractly  we  view  a  program  as  a  guarded-transition 
systems  and  analyze  transition  sequences.  A  guarded  tran¬ 
sition  system  consists  of  a  set  of  variables,  which  for  our 
purposes  are  coalesced  into  a  single  composite  state  vari¬ 
able  s,  and  a  set  of  guarded  transitions  which  atomically 
test,  with  predicate  0,  the  current  state  and  update  the  state 
by  executing  a  transition,  a,  i.e.,  if  ^{s)  then  s  =  a{s).  The 
initial  values  of  program  variables  are  used  to  define  an  ini¬ 
tial  state.  So. 

Figure  2  presents  the  basic  DFS  algorithm  that  generates 
the  program  state- space  terminating  when  it  finds  an  error 
or  finds  all  reachable  states.  basicDFS  initializes  the  set 
of  states  seen  in  the  search,  and  the  stack  that  stores  the  cur¬ 
rent  path  in  the  state-space  being  analyzed,  and  then  starts 
a  recursive  chain  of  DFSs  from  the  initial  state.  Lines  4-14 
comprise  a  step  in  the  DFS  search.  On  line  4,  enabled{s)  re¬ 
turns  the  set  of  transitions,  a,  whose  guard,  is  true  in  the 
given  state.  Line  5  iterates  through  the  set  of  enabled  transi¬ 
tions  and  we  assume  that  the  order  of  iteration  is  fixed,  i.e., 
it  is  the  same  for  every  every  run  of  the  algorithm,  which 
is  the  default  for  all  existing  state- space  analysis  tools  that 
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we  are  aware  of.  Lines  7-9  test  if  an  error  state  has  been 
reached,  and  if  so,  record  the  current  DFS  stack,  which  en¬ 
codes  the  path  under  analysis,  as  a  counterexample  and  ex¬ 
its. 

3.2.  Randomized  State-space  Search 

Researchers  in  randomized  testing  [2]  have  explored  the 
use  of  randomized  sampling  of  a  program’s  input  domain 
to  detect  errors.  In  contrast,  we  randomize  the  sequence  of 
scheduling  decisions  that  are  made  by  the  underlying  run¬ 
time  system  in  executing  a  program. 

Randomization  of  DFS  is  achieved  by  applying  a  Fisher- 
Yates  shuffle  [17],  lines  5-9  of  Figure  3,  to  the  sequence  of 
enabled  transitions  at  each  state  explored.  Each  time  the  al¬ 
gorithm  executes,  the  order  in  which  enabled  transitions  is 
explored  on  line  10  is  randomized.  This  approach  to  ran¬ 
domization  has  the  advantage  that  reduction  techniques  that 
operate  by  modifying  the  set  of  enabled  transitions,  such  as 
partial-order  reductions  for  Java  [5],  can  be  applied  first  and 
then  the  sequence  in  which  the  remaining  transitions  are  ex¬ 
plored  is  randomized.  Randomization  in  the  shuffle  follows 
a  pseudo-random  sequence  whose  seed  is  passed  as  a  pa¬ 
rameter  to  randDFS,  in  Figure  3,  and  used  to  initialize  the 
sequence  on  line  2.  When  an  error  is  detected  the  analysis 
returns  the  seed  along  with  the  sequence  of  program  transi¬ 
tions  as  a  counter-example  (line  14).  This  allows  replay  of 
randomized  runs  to  analyze  counter-examples  in  detail. 

3.3.  Parallel  State-space  Search 

PRSS,  shown  in  Figure  4,  accepts  a  parameter  (N)  that 
controls  the  degree  of  parallelism  to  be  applied  in  the  anal¬ 
ysis  and  a  parameter  (seed)  that  gives  users  control  over  the 
randomization  in  the  algorithm;  passing  the  same  seed  pro¬ 
vides  reproducibility  whereas  passing  a  random  sequence  of 
seeds  provides  effective  randomization.  The  analysis  starts 
N  copies  of  a  randomized  DFS  (lines  1-3)  each  with  a  dif¬ 
ferent  seed  that  is  calculated  based  on  a  pseudo-random  se¬ 
quence  that  is  initialized  with  the  seed  parameter. 

There  are  many  different  implementation  strategies  that 
can  be  applied  to  distribute  jobs  to  nodes  in  a  parallel  ma¬ 
chine  or  distributed  cluster.  We  describe  a  polling  ap¬ 
proach  based  on  three  abstract  primitives:  start(mj)  exe¬ 
cutes  method  m  on  machine  i,  done(i)  polls  to  determine  if 
the  job  on  machine  i  is  complete,  and  stop(i)  terminates  the 
job  on  machine  i.  It  would  be  a  simple  matter  to  map  the 
logic  of  lines  4-9  to  primitives  that  block  until  job  comple¬ 
tion  rather  than  use  this  polling  approach. 

When  a  job  completes  it  will  be  detected  within  N  calls 
to  done  and  its  index  is  then  recorded  as  the  first  to  complete 
(line  7)  and  the  polling  loop  is  exited  (line  8);  we  are  not 
concerned  with  the  minor  differences  in  run-time  that  would 


arise  due  to  races  among  jobs  completing  at  approximately 
the  same  time.  Lines  10-12  shutdown  all  other  executing 
jobs  and  the  counterexample  from  the  first  ]oh  is  printed. 

There  are  several  notable  aspects  of  this  algorithm.  (1) 
Unlike  many  existing  approaches  to  parallelization  of  state- 
space  search,  which  we  discuss  in  detail  in  Section  6,  PRSS 
is  embarrassingly  parallel  [9].  The  N  parallel  random¬ 
ized  depth-first  searches  are  performed  completely  inde¬ 
pendently  such  that  state  information  collected  and  used  by 
each  search  job  is  kept  local  to  the  job  and  need  not  to  be 
exposed  in  any  way  to  the  other  parallel  searches.  This  elim¬ 
inates  the  the  need  for  costly  inter-process  communication 
and  coordination  between  jobs. 

(2)  PRSS  runs  multiple  simultaneous  state- space 
searches  in  distinct  portions  of  the  state- space;  the  likeli¬ 
hood  of  two  searches  ending  up  in  the  same  region  of  the 
state-space  is  low.  By  using  multiple  randomized  searches 
to  explore  a  single  state-space,  the  chance  that  one  search 
will  explore  a  region  that  is  relatively  dense  with  error  states 
is  increased  over  a  single  search,  and  the  penalty  for  search¬ 
ing  in  a  region  that  is  free  of  errors  is  mitigated  since  a  sib¬ 
ling  search  may  be  making  progress  at  the  same  time. 

(3)  PRSS  leverages  all  of  the  optimizations  applied  to  the 
underlying  DFS  algorithms  and  its  precision  is  limited  only 
by  the  precision  of  the  underlying  DFS.  Note  that  it  neither 
creates  additional  behavior  nor  removes  existing  behavior  in 
the  state-space  and  therefore  does  not  affect  the  soundness 
of  the  underlying  search  technique. 

4.  Study 

The  purpose  of  our  study  was  to  evaluate  the  cost  and  ef¬ 
fectiveness  of  PRSS  for  error  detection.  We  set  the  study 
in  the  context  of  a  collection  of  Java  programs  contain¬ 
ing  concurrency-related  defects,  and  compared  the  perfor¬ 
mance  and  fault  detection  capabilities  of  PRSS  at  various 
degrees  of  parallelism  against  JPF’s  default  search  settings. 
We  used  JPF’s  RandomOrderScheduler  to  implement 
the  randDFS  algorithm  in  Figure  4.  The  speciflc  PRSS 
conflgurations  evaluated,  specifled  as  the  number  of  paral¬ 
lel  randomized  searches,  are  described  in  Section  4.1.1.  For 
this  study  we  investigated  the  following  research  questions: 

RQl:  (Cost  Reduction)  Does  there  exist  a  feasible  conflg- 
uration  of  PRSS  that  can  detect  a  program  error  more 
quickly  than  performing  a  state-space  search  using  the 
default  search  order?  Where,  by  feasible,  we  mean  a 
number  of  parallel  processing  nodes  that  might  reason¬ 
ably  be  available  to  a  software  testing  organization. 

RQ2:  (Parallel  Speedup)  Does  the  performance  of  PRSS 
improve  with  increased  parallelism?  If  so,  is  there  a 
point  of  diminishing  return? 
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Subject 

Source 

Parameters 

Error 

#  Threads 

Classes 

SLOC 

BoundedBuffer(3,6,6,l) 

[4] 

modCount,  bufferSize, 
#producers,  #consumers 

Deadlock 

13 

5 

65 

DaisyO 

[21] 

none 

Assertion  Violation 

3 

21 

744 

DEOS(false) 

[10] 

abstracted? 

Assertion  Violation 

4 

24 

838 

ElevatorO 

[7] 

none 

ArrayIdxOOBExcpn 

4 

12 

934 

RaxExtended(4,3,false) 

[10] 

gc,  we,  envFirst? 

Assertion  Violation 

6 

11 

127 

ReplicatedWorkers(5,2,0.0, 

10.7,0.05) 

[4] 

#workers,  #items,  min, 
max,  epsilon 

Deadlock 

6 

14 

304 

RWNoDeadLckCk(2,2, 1 00) 

[4] 

#readers,  # writers,  bound 

Assertion  Violation 

5 

6 

103 

Table  1 .  Study  artifacts 


RQ3:  (Fault  Detection)  Can  PRSS  be  used  to  detect  an 
error  in  programs  where  the  default  searcher  fails  be¬ 
cause  of  insufficient  time  or  space? 

4.1.  Characterization  Variables 

4.1.1  Independent  Variable 

To  answer  our  research  questions,  we  manipulated  one  inde¬ 
pendent  variable:  the  number  of  parallel  randomized  state- 
space  searches.  For  practical  purposes,  this  measure  repre¬ 
sents  the  number  of  parallel  processors  or  nodes  used  when 
applying  the  PRSS  technique.  Because  there  is  no  fixed  up¬ 
per  bound  on  the  number  of  parallel  searches  one  might  per¬ 
form,  and  because  it  would  be  impractical  for  a  study  such 
as  this  to  attempt  to  test  every  potential  node  configuration, 
we  chose  11  different  configurations  including  1,  2,  5,  10, 
15,  20,  25,  50,  100,  500,  and  1000  parallel  nodes.  Our  goal 
was  to  select  a  set  of  practical  values  that  includes  a  suffi¬ 
cient  number  and  range  of  data  points  to  be  able  to  identify 
trends  in  cost  and  performance. 


4.1.2  Dependent  Variables 

The  dependent  variable  for  RQl  and  RQ2  is  tool  perfor¬ 
mance.  We  measure  performance  in  terms  of  the  number  of 
program  states  explored.  We  use  this  measure  because  it  is 
platform-independent  and  it  is  a  common  metric  for  evaluat¬ 
ing  state-space  exploration  tool  performance,  such  as  model 
checker  performance.  In  JPF,  this  metric  is  referred  to  as  the 
number  of  new  states. 

For  RQ3,  the  dependent  variable  is  fault  detection  capa¬ 
bility.  This  variable  is  simply  a  measure  of  whether  the  tech¬ 
nique  detects  the  program  fault  or  not.  Each  technique  is 
tested  under  the  same  conditions  (i.e.  resource  constraints) 
which  means  that  the  opportunity  to  detect  the  program  er¬ 
ror  is  equal  for  all  techniques. 


4.2  Artifacts 

Seven  unique  concurrent  Java  programs  form  the  collec¬ 
tion  of  artifacts  for  our  study.  All  programs  exhibit  a  single 
concurrency  error  represented  as  a  deadlock,  an  exception, 
or  an  assertion  violation.  Table  1  describes  the  programs. 

The  programs  were  selected  from  the  population  of  56 
parameterized  artifacts  used  in  [6].  Because  this  study  is 
focused  on  hard  to  find  defects,  we  limited  the  selection 
of  artifacts  to  all  but  one  of  the  programs  that  were  classi¬ 
fied  as  ’’realistic.”  This  class  of  programs  contains  Java  ar¬ 
tifacts  that  perform  a  computation  over  rich  data  structures, 
many  of  which  have  been  previously  used  in  slightly  differ¬ 
ent  forms  to  evaluate  Java  state-space  search  techniques  in 
the  literature.  The  only  ’’realistic”  program  from  that  study 
that  was  not  used  is  AlarmClock.  This  particular  program 
was  omitted  from  the  current  study  because,  although  it  is 
interesting  in  some  contexts,  its  small  state-space  does  not 
challenge  state-of-the-art  search  techniques. 

4.3.  Study  Design  and  Setup 

To  conduct  this  study,  we  needed  to  evaluate  the  arti¬ 
facts  on  each  of  the  parallel  search  configurations.  This  re¬ 
quired  a  minimum  of  1,728  randomized  searches  per  arti¬ 
fact,  i.e.,  the  sum  of  the  configuration  sizes  mentioned  in 
Section  4.1.1  per  artifact. 

Based  on  our  previous  experience,  where  we  observed 
that  program  state-spaces  can  be  extremely  large  and  that 
the  number  of  states  visited  before  detecting  the  program 
defect  can  vary  greatly,  we  chose  to  evaluate  each  artifact  50 
times  for  each  parallel  search  configuration.  This  meant  we 
required  86,400  searches  per  artifact,  and  604,800  searches 
total  for  seven  artifacts. 

To  control  the  costs  of  the  conducting  the  study,  we 
chose  instead  to  produce  a  pool  of  5000  random  searches 
for  each  artifact,  from  which  n  searches  would  be  randomly 
selected  to  represent  a  configuration  of  n  parallel  searches 
for  each  experiment.  The  pool  size  of  5000  was  selected 
based  on  our  previous  experience. 
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The  following  steps  were  then  performed  to  obtain  our 
study  results.  For  each  program  artifact: 

1.  We  performed  5000  random  searches  using  JPF  ver¬ 
sion  3.1.2  on  a  cluster  of  dual-Opteron  250’s  running 
at  2.4  GHz  with  16GB  of  memory  and  running  Fedora 
Core  3  Linux.  Each  randomized  search  used  a  distinct 
seed  generated  from  a  pseudo-random  sequence,  and 
was  limited  to  one  hour  of  execution  time  and  2GB  of 
memory,  with  the  exception  of  BoundedBuffer.  Higher 
bounds  (14GB  and  four  hours)  were  used  for  Bound¬ 
edBuffer  in  order  to  evaluate  the  PRSS  technique  on  a 
program  with  a  larger  state- space. 

2.  To  simulate  a  run  of  n  parallel  randomized  searches  for 
a  given  artifact,  we  randomly  sampled,  with  replace¬ 
ment,  the  pool  of  5000  randomized  searches  for  that 
artifact  n  times.  We  repeated  this  sampling  process  to 
produce  a  total  of  50  trials  to  account  for  potential  vari¬ 
ation  across  samples. 

3.  From  each  sample  of  size  n,  we  chose  the  search  with 
the  shortest  time  to  represent  the  search  that  would 
have  completed  first  if  the  searches  had  actually  been 
performed  in  parallel.  In  the  case  of  a  tie,  one  search 
result  was  selected  from  the  group. 

4.4  Threats  to  Validity 

In  this  section,  we  describe  the  internal,  external,  con¬ 
struct  and  conclusion  threats  to  the  validity  of  this  study. 
We  also  include  the  approaches  we  designed  to  minimize 
the  impact  of  these  threats  on  our  findings. 

Internal  threats.  Setting  different  bounds  for  the  model 
checker  can  clearly  impact  the  findings.  For  example,  un¬ 
limited  time  and  memory  would  allow  all  searches  to  find 
the  program  defect.  Conversely,  for  some  searches,  one 
might  expect  that  increasing  the  time  or  memory  bound 
might  simply  allow  the  analysis  to  take  longer  to  exhaust 
those  resources.  Our  choice  for  upper  bound  on  time  and 
memory  for  JPF  was  primarily  meant  to  be  consistent  with 
settings  used  in  other  recent  studies. 

External  threats.  Our  study  was  performed  on  a  single 
state-space  search  tool  -  JPF  version  3.1.2.  Different  ver¬ 
sions  of  JPF  or  different  state-space  analysis  tools  may  yield 
different  results.  Replicated  studies  with  different  versions 
of  JPF  or  with  different  tools  would  address  this  threat.  The 
artifacts  chosen  for  this  study  may  also  affect  the  results. 
We  selected  artifacts  classified  as  ’’realistic”  programs  from 
the  population  of  artifacts  used  in  [6]  in  an  attempt  to  eval¬ 
uate  the  effectiveness  of  PRSS  on  detecting  hard  to  find 
defects.  We  do  not  know,  however,  if  these  artifacts  and 
the  defects  they  contain  are  truly  representative  of  hard  to 


find  defects  in  the  broader  population  of  multi-threaded  Java 
programs. 

Construct  threats.  The  measures  we  selected  for  this  study 
provide  what  we  believe  are  a  reasonable  way  to  evaluate  its 
results.  However,  other  measures  may  provide  perspectives 
that  we  did  not  consider.  Nevertheless,  to  be  consistent  with 
other  studies  and  more  relevant  to  the  model  checking  com¬ 
munity,  we  decided  to  use  the  number  of  new  states  which 
is  platform-independent  and  commonly  used  in  evaluating 
model  checking  and  other  state-space  analysis  tools. 

Conclusion  threats.  In  order  to  execute  this  study,  we 
chose  to  simulate  each  parallel,  randomized  search  for  a 
given  artifact  by  randomly  selecting  a  search  from  the  pool 
of  5000  randomized  searches  performed  on  that  artifact.  It 
is  possible  that  the  pool  size  of  5000  randomized  searches 
per  artifact  is  not  sufficiently  diverse  to  accurately  represent 
the  set  of  all  feasible  randomized  searches  for  that  artifact. 
It  is  also  possible  that  the  number  of  trials  (50)  performed 
on  each  artifact  for  each  parallel  randomized  search  con¬ 
figuration  does  not  accurately  represent  the  set  of  feasible 
results.  We  attempted  to  mitigate  these  threats  by  choosing 
the  pool  size  and  number  of  trials  based  on  the  experiences 
gained  in  our  previous  study.  For  example,  in  our  previ¬ 
ous  study,  500  randomized  searches  produced  a  stable  vari¬ 
ance  in  the  number  of  states  visited  to  first  error  for  some 
of  our  artifacts  but  not  all.  We  therefore  set  the  pool  size  at 
5000,  an  order  of  magnitude  larger,  in  an  attempt  to  achieve 
a  more  stable  variance  in  all  artifacts.  Overall,  given  the  ex¬ 
ploratory  nature  of  the  study  at  this  point  we  do  not  consider 
limited  pool  size  to  be  a  major  source  of  concern. 

5.  Study  Results 

Figure  5  provides  a  graphical  depiction  of  the  results  of 
our  study  in  a  series  of  seven  plots,  one  per  program.  Within 
each  plot,  for  each  PRSS  configuration,  we  show  the  mean 
cost,  in  new  states  explored,  and  the  standard  deviation  in 
cost  over  the  50  trials  we  evaluated.  We  only  show  data  up 
to  25  parallel  nodes  for  PRSS  for  all  of  the  programs  except 
BoundedBuffer,  where  we  show  data  up  through  50  par¬ 
allel  nodes.  Only  these  configurations  were  included  in  the 
graphs  because  the  trends  are  nearly  fiat  and  unchanging 
beyond  these  points. 

The  plots  include  three  additional  reference  lines.  De¬ 
fault  States  and  Min  States  represent  the  Default  and  Min¬ 
imum  values  from  Table  2,  respectively.  Note  that  in  some 
cases  the  default  value  is  off  the  scale  and  therefore  not 
shown,  because  the  default  search  either  ran  out  of  mem¬ 
ory  or  exceeded  the  time  limit.  The  100%  Runs  Completed 
line  indicates  the  point  at  which  all  of  the  50  trials  for  the 
PRSS  configurations  to  the  right  of  the  line  completed  and 
found  the  error;  to  the  left  of  this  line,  at  least  one  of  the  50 
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Figure  5.  Scaled  PRSS  performance 
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Artifact 

Default 

Sta 

Eound 
w/  Default 

tes 

Minimum 

Maximum 

Nodes 

PDR 

Mean  States 

Speedup 

BoundedBuffer(3,6,6,l) 

2603200 

OM 

1702 

1573469 

20 

313290 

>8.3 

DaisyO 

101816 

V 

140 

99865 

5 

37715 

2.7 

DEOS(false) 

260039 

V 

75919 

1465215 

15 

143860 

1.8 

ElevatorO 

11743698 

TO 

57082 

6751854 

10 

116960 

>100.4 

RaxExtended(4,3  ,false) 

3470398 

TO 

41 

3176159 

5 

176 

>19718.1 

ReplicatedWorkers(5, 2, 0.0, 10.7,0.05) 

6231840 

TO 

372 

6642260 

15 

226790 

>27.5 

RWNoDeadLckCk(2,2, 1 00) 

3147356 

V 

40 

24796751 

10 

1847 

1704.0 

Table  2.  Results  Summary 


trials  of  parallel  searches  at  a  given  PRSS  node  configura¬ 
tion  either  ran  out  of  memory  or  exceeded  the  time  bound. 

In  Table  2  we  summarize  the  results  of  our  study.  We 
show  the  number  of  states  explored  by  the  default  search 
and  indicate  if  the  search  completed  (y^),  timed  out  (TO) 
or  ran  out  of  memory  (OM).  The  Minimum  and  Maximum 
values  are  the  observed  minimum  and  maximum  number 
of  states  explored  by  the  random  searches  in  the  pool.  The 
Point  of  Diminishing  Returns  (PDR)  values  are  explained 
in  Section  5.2.  In  the  remainder  of  this  section,  we  consider 
each  of  the  research  questions  in  turn. 

5.1.  RQl  -  Cost  Reduction 

The  plots  in  Figure  5  clearly  indicate  that,  for  our  study, 
there  is  always  at  least  one,  and  often  many,  feasible  PRSS 
configurations  capable  of  detecting  an  error  more  quickly 
than  the  default  search.  In  the  case  where  the  default  search 
does  not  complete  execution  (i.e.,  times  out  or  runs  out  of 
memory),  this  observation  still  holds  because  the  number 
of  states  explored  by  the  default  search,  as  presented  in  Ta¬ 
ble  2,  can  be  viewed  as  an  under-approximation  of  the  ac¬ 
tual  number  of  states  that  would  need  to  be  explored  in  order 
to  detect  the  error. 

For  Elevator  and  RaxExt ended,  even  running  a 
single  randomized  DFS  finds  the  error  in  all  of  the  trials 
we  performed  with  one  node.  This  is  remarkable  given  the 
size  of  the  state  space  searched  by  the  default  run  before 
running  out  of  resources. 

For  Daisy  and  DEOS,  simply  performing  a  single  ran¬ 
domized  DFS  may  not  yield  a  more  efficient  analysis  ac¬ 
cording  to  our  experiment;  however,  increasing  the  paral¬ 
lelism  to  2  and  15  nodes,  respectively,  for  these  examples 
beats  the  default  in  all  50  trials. 

RWNoDeacLckCk  shows  a  similar  trend,  but  with  the 
additional  fact  that  below  10  parallel  randomized  DFSs 
there  is  a  possibility  that  one  or  more  randomized  searches 
fails  to  complete  -  even  when  the  default  completes.  At 
10  nodes,  however,  PRSS  beats  the  performance  of  de¬ 
fault  by  a  factor  of  1700,  has  almost  no  variation  in  this 
performance  across  the  50  trials,  and  never  fails  to  find 


the  error  in  our  experiment.  Re  plicate  dWorkers  and 
BoundedBuffer  show  similar  trends  where  a  degree  of 
parallelism  of  25  and  50,  respectively,  is  needed  to  achieve 
100%  error  detection  according  to  our  study.  Based  on 
these  findings,  it  seems  clear  that  there  exist  feasible  con¬ 
figurations  of  PRSS  that  can  detect  a  program  error  more 
quickly  than  performing  a  state-space  search  using  the  de¬ 
fault  search  order. 

5.2.  RQ2  -  Parallel  Speedup 

The  plots  in  Figure  5  share  a  characteristic  shape.  For 
all  artifacts,  the  curve  has  a  downward  trend  as  parallelism 
is  increased  and  a  leveling  off  towards  higher  degrees  of 
parallelism.  These  plots  confirm  that  the  performance  of 
PRSS  improves  with  increased  parallelism.  Furthermore,  as 
parallelism  increases,  the  variation  in  performance  observed 
decreases.  This  is  because  a  larger  degree  of  parallelism  ef¬ 
fectively  increases  the  sample  size  of  the  set  of  randomized 
searches  and  the  likelihood  of  finding  an  inexpensive  search 
increases. 

By  inspecting  these  plots,  we  are  able  to  approximate  a 
Point  of  Diminishing  Returns  (PDR)  which  is  an  estimate  of 
the  degree  of  parallelism  beyond  which  additional  compu¬ 
tational  resources  provide  increased  performance  that  is  not 
justified  by  those  extra  resources.  Our  definition  of  PDR  is 
informal  and  intuitive:  all  of  the  authors  of  this  paper  stud¬ 
ied  the  data  and  determined  what  they  believed  the  PDR  to 
be.  We  agreed  on  the  PDR  for  all  but  one  example,  DEOS, 
where  some  authors  thought  the  value  was  10  and  others 
thought  15. 

Table  2  shows  the  relatively  small  number  of  parallel 
nodes  corresponding  to  the  PDR;  in  all  cases,  we  found 
this  number  to  be  less  than  20.  The  table  also  shows  the 
speedup  of  the  PDR  configuration  of  PRSS  over  the  default 
search;  speedups  for  artifacts  whose  search  did  not  finish 
are  considered  lower  bounds.  These  indicate  the  benefits 
of  using  PRSS.  The  variation  in  speedups  is  enormous,  but 
all  of  the  examples  exhibit  non-trivial  speedup  and  many 
have  an  order  of  magnitude  or  more  speedup.  For  some 
examples,  it  is  clear  that  there  are  more  efficient  searches 


that  could  be  performed  with  more  nodes  than  the  number 
we  identified  as  the  PDR.  For  example,  BoundedBuf  f  er 
and  ReplicatedWorkers,  speedups  of  33.8  at  50  nodes 
and  36231.6  at  25  nodes,  respectively  are  achieved. 

5.3.  RQ3  -  Fault  Detection 

In  choosing  the  artifacts  for  this  study,  our  goal  was  to 
choose  programs  that  contain  hard  to  find  defects.  Of  the 
seven  artifacts  selected,  four  have  defects  that  were  not  de¬ 
tected  by  using  the  default  search  order  because  they  either 
timed  out  or  ran  out  of  memory.  For  all  of  those  artifacts,  we 
were  able  to  use  PRSS  to  consistently  find  the  error  given  a 
sufficient  level  of  parallelism. 

For  Elevator  and  RaxExt ended,  all  configurations 
of  PRSS  found  the  defect  in  our  experiments  in  all  of 
the  50  trials  performed.  For  ReplicatedWorkers  and 
BoundedBuf fer,  the  error  was  consistently  detected 
when  the  degree  of  parallelism  was  increased  to  25  and  50 
nodes,  respectively.  We  conclude  that  PRSS  can  he  used  to 
detect  an  error  in  artifacts  where  the  default  searcher  fails 
because  of  insufficient  system  resources. 

Since  finding  errors  in  large  multi-threaded  programs 
is  not  cost-effective  using  existing  state-space  search  ap¬ 
proaches,  some  researchers  have  turned  to  more  modular 
approaches  where  an  application  is  broken  into  pieces  and 
those  pieces  are  analyzed  independently  [25,  26].  It  would 
be  interesting  to  explore  the  application  of  PRSS  to  those 
applications  to  see  if  errors  can  be  detected  without  the 
added  costs  associated  with  modular  reasoning,  for  exam¬ 
ple,  through  the  construction  of  environments  that  simulate 
the  calling  context  of  a  program  component. 

6.  Related  Work 

Given  the  computational  cost  of  state-space  search,  it 
is  natural  to  wonder  whether  it  can  be  effectively  paral¬ 
lelized.  Stern  and  Dill  report  on  the  parallelization  of  the 
Mmcj)  model  checker  [23].  Their  approach  stands  as  the 
model  upon  which  all  other  techniques  in  the  literature  are 
based.  They  distribute  a  collection  of  searches  targeting 
portions  of  the  state- space  rooted  at  different  nodes.  A 
shared  seen  set  is  used  to  keep  searches  from  performing 
redundant  work.  This  set  must  be  locked  to  ensure  coher¬ 
ent  updates.  The  overhead  of  locking,  and  the  poor  local¬ 
ity  in  the  sub-state-spaces  searched  in  parallel,  cause  this 
algorithm  to  scale  poorly.  Researchers  have  explored  the 
use  of  lock  free  shared  structures,  to  minimize  contention, 
and  dynamic  load  balancing  [18],  but  even  with  those  im¬ 
provements  the  coordination  of  multiple  searches  seems  to 
greatly  limit  scalability.  Our  approach  is  embarrassingly 
parallel,  so  it  has  no  coordination  overhead,  but  it  may  do  ar¬ 


bitrary  amounts  of  redundant  work,  which  reduces  the  use¬ 
ful  parallel  work  it  performs. 

The  idea  of  using  randomization  in  state-space  search 
dates  back  to  West  [28]  who  showed  that  it  can  be  ef¬ 
fective  in  finding  bugs  in  large  protocols.  It  is  sup¬ 
ported  in  modern  tools,  for  example,  JPF  has  had  the 
RandomOrderScheduler  component  for  several  years, 
but  it’s  combination  with  parallel  execution  had  not  yet  been 
explored  or  validated  empirically  until  our  work. 

Randomization  in  state  space  search  can  be  used  to  con¬ 
trol  the  schedulings  explored,  as  in  our  work,  or  to  control 
which  states  are  stored  in  the  seen  set.  To  control  mem¬ 
ory  requirements,  techniques  like  bit-state  hashing  [15]  ran¬ 
domly  drop  states  from  the  seen  set.  While  lossy,  this  ap¬ 
proach  can  scale  analyses  to  very  large  problems.  In  [14], 
multiple  bit-state  hashing  runs  are  explored  in  parallel  to  re¬ 
duce  the  time  to  find  errors.  This  approach  is  very  similar  to 
ours  except  that  our  approach  is  not  lossy,  since  the  under¬ 
lying  randomization  technique  is  not  lossy.  While  [14]  de¬ 
scribes  the  techniques  use  for  large  systems,  there  is  no  em¬ 
pirical  study  of  the  performance  improvements  seen  when 
parallelization  and  randomization  are  used  together  in  dif¬ 
ferent  configurations. 

Recent  work  has  developed  the  concept  of  Monte  Carlo 
Model  Checking  [11]  which  computes  a  bound  on  the  prob¬ 
ability  that  randomized  walks  of  the  state- space,  beyond  a 
specified  value,  will  find  an  error;  this  is  not  a  bound  on  the 
probability  that  an  error  exists.  We  make  no  attempt  to  es¬ 
timate  the  benefits  of  additional  randomization,  but  instead 
observe  empirically  that  relatively  small  numbers  of  sam¬ 
ples  seem  sufficient  for  error  detection. 

Randomization  in  software  testing  is  an  old  idea  [2]  that 
has  proven  to  be  effective  in  practice  [8].  Approaches  for 
randomizing  the  input  space  of  a  program  under  constraints 
designed  to  improve  error-detection  have  been  proposed 
[3,  20]  and  they  seem  to  be  effective.  These  techniques 
do  not  target  concurrent  executions  explicitly  and  make  no 
attempt  to  randomize  the  scheduler’s  behavior.  This  may 
make  them  less  effective  at  revealing  concurrency  errors. 
It  would  be  interesting  to  adapt  the  intuition  of  these  ap¬ 
proaches  to  randomized  scheduling  [24].  Our  approach 
considers  program  input  to  be  fixed,  and  rather  than  per¬ 
forming  a  stateless  search,  we  randomize  a  stateful  search. 
Our  experience  suggests  that  both  the  randomness  and  the 
state-fullness  are  key  ingredients  to  its  success. 

7.  Conclusions  and  Future  Work 

We  have  presented  a  simple  and  cost-effective  technique 
for  amplifying  the  benefits  of  existing  optimizations  for 
state-space  search  targeted  at  error  detection.  We  believe 
this  approach  to  be  broadly  compatible  with  explicit- state 
model  checking  approaches  and  applicable  across  a  wide 
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range  of  programs.  Further  empirical  studies  would  be  valu¬ 
able  in  validating  this  belief,  but  the  results  from  our  study 
suggest  that  a  practical  and  significant  cost-reduction  can  be 
achieved  in  the  analysis  of  programs  with  large  state-spaces. 

In  the  future,  we  would  like  to  explore  the  impact  of 
parallelization  and  randomization  on  other  forms  of  state- 
space  search,  such  as  variants  of  breadth-first  and  heuristic 
searches.  Heuristics  tend  to  focus  a  search  on  portions  of 
the  state  space,  but  when  the  heuristic  scoring  function  is 
discrete,  multiple  enabled  transitions  can  receive  the  same 
score.  Given  that  randomization  appears  effective  in  im¬ 
proving  the  performance  of  state-space  search  over  the  de¬ 
fault  order,  it  may  also  prove  effective  in  shuffling  the  order 
in  which  ties  are  broken  and  thereby  speed  error  detection 
for  heuristic  searches  as  well. 
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